1 Contributors

組員 系級 學號 工作分配
陳品瑄 資科碩一 112971018 數據概要、R markdown 彙整
林宴葶 資科碩一 112971022 數據進行清理和處理特徵工程
傅國書 資科碩一 112971025 負責模型的訓練及測試部份
張祐誠 資科碩一 112971013 分析流程與程式架構規劃、程式彙整、報告說明撰寫

2 Overview

Retension(客戶留存) 對於每一間企業的客戶關係管理 Customer Relationship Management (CRM) 都是十分重要的。

這個比率關乎企業在帶進行新客戶時,可以留下多少客戶。企業要留得住客戶,才有辦法長久發展。

在這個期末報告中,我們將使用 Kaggle Telco-customer-churn 的資料集,找出哪些客戶可能不再使用公司服務。

對客戶流失前進行挽留的行銷活動,為企業留住客戶。

2.1 Analytics Highlight

在這個專案中,我們實作了一個在客戶關係管理中,偵測哪些客戶會流失的模型

我們採用了預測能力非常高的 XGBoost 模型做預測

採用 ROC, AUC, Lift Analysis 以及 Null Model Analysis 來評估模型成效

在預測的結果中,我們可以針對最容易流失的客戶,讓企業對潛在流失戶做精準行銷挽留客戶

2.2 Data Source

Kaggle - Telco Customer Churn

  • 實務情境

    預測行為以留住客戶。

    可以分析所有相關的客戶資料並進一步制定有針對性的客戶保留計劃。

  • Raw Data

    每個 row 代表一個客戶。

    每個 column 代表所描述的客戶屬性。

  • Detailed Information

    • Churn:

      前一個月內離開的客戶

    • 每位客戶註冊的服務:

      家庭電話服務、多條電話線路、網路服務、附加網路安全服務、線上備份服務、網路設備附加保護計劃、額外的技術支援服務、串流媒體電視節目、電影

    • 客戶訂閱資訊:

      該客戶訂閱服務時間、合約、付款方式、無紙化帳單、每月費用、總費用等

    • 客戶個資:

      性別、年齡區間、獨居或與他人同住等

Column Name Description Remark
CustomerID 識別每個客戶的唯一 ID String
Gender 顧客的性別 Male, Female
Age 財務季度結束時客戶的當前年齡(以年為單位) Number
Senior Citizen 是否年滿 65 歲 Yes, No
Married (Partner) 是否已婚 Yes, No
Dependents 是否與任何家屬同住 Yes, No.
Number of Dependents 是否與任何受扶養人同住 Yes, No.
Phone Service 是否向公司訂購了家庭電話服務 Yes, No
Multiple Lines 是否與公司預訂了多條電話線路 Yes, No
Internet Service 是否向本公司訂購網路服務 No, DSL, Fiber Optic, Cable.
Online Security 是否訂閱了公司提供的附加線上安全服務 Yes, No
Online Backup 是否訂閱了本公司提供的附加線上備份服務 Yes, No
Device Protection Plan 是否訂閱了該公司為其互聯網設備提供的附加設備保護計劃 Yes, No
Premium Tech Support 是否訂閱了公司的附加技術支援計劃以減少等待時間 Yes, No
Streaming TV 是否使用其網路服務從第三方供應商串流媒體電視節目 Yes, No.
Streaming Movies 是否使用其 Internet 服務從第三方供應商串流影片 Yes, No.
Contract 客戶目前的合約類型 Month-to-Month, One Year, Two Year.
Paperless Billing 客戶是否選擇無紙化計費 Yes, No
Payment Method 客戶如何支付帳單 Bank Withdrawal, Credit Card, Mailed Check
Monthly Charge 客戶目前每月為本公司提供的所有服務支付的總費用 Number
Total Charges 截至上述指定季度末計算的客戶總費用 Number
Tenure 客戶訂閱服務時間 Number
Churn 是 = 客戶本季離開了公司;否 = 客戶仍留在公司 Yes, No

2.3 Analytics Target

  • 什麼樣特徵的人容易 Churn?

  • 誰會 Churn? 準確度多少?


3 Data Profiling

3.1 Introduce the Data

3.1.1 introduce

intro <- introduce(dataset)
kable(intro)
rows columns discrete_columns continuous_columns all_missing_columns total_missing_values complete_rows total_observations memory_usage
7043 21 17 4 0 11 7032 147903 1641832
  • rows = 7043

  • columns = 21

  • discrete_columns = 17

  • continuous_columns = 4

  • all_missing_columns = 0

  • total_missing_values = 11

  • complete_rows = 7032

  • total_observations = 147,903

  • memory_usage = 1,641,832 (Bytes)

3.1.2 plot_intro

資料基本描述

  • columns (features) : 離散資料欄位、連續資料欄位、無資料欄位

  • complete rows (customers) : 完整資料的客戶數

  • missing observations : 缺失値

plot_intro(dataset)

3.1.3 plot_missing

只顯示有 missing value 的欄位

plot_missing(dataset, missing_only = TRUE)

3.2 Discrete

3.2.1 plot_bar

  • 離散資料

  • 以顏色區分 Churn

3.3 Continuous

3.3.1 plot_histogram

  • 連續資料

    • 每月費用

    • 該客戶訂閱服務時間

    • 總費用

3.3.2 plot_boxplot

  • 連續資料
    • 每月費用
    • 該客戶訂閱服務時間
    • 總費用

3.4 PCA

3.4.1 plot_prcomp


4 Data Cleaning

4.1 Overview

The steps involved in data cleaning:

  1. Check whether the data types are correct for each variable using str() function.

  2. Handling Missing Values:

    2.1. Perform KNN (K-Nearest Neighbors) imputation specifically for the “TotalCharges” variable.

  3. Standardizing Data(Convert text to a consistent case):

    3.1. Conditionally transform values that start with “N” and replace them with “No”.

4.2 Comparison of Original and Cleaned Data

  • Original Data
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No phone service" "No" "No" "No phone service" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
  • Cleaned Data
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No" "No" "No" "No" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...

5 Feature Engineering

5.1 Overview

The steps involved in feature engineering:

  1. Remove the specific column “TotalCharges”:

The removal of ‘TotalCharges’ from the model is justified due to its high correlation with ‘MonthlyCharges’ and ‘tenure’.

Since ‘TotalCharges’ is mathematically derived as the product of ‘MonthlyCharges’ and ‘tenure’,

it does not provide additional independent information.

Including ‘TotalCharges’ in the model can lead to redundancy and potential multicollinearity issues.

  • model_with_totalcharges:
## 
## Call:
## lm(formula = TotalCharges ~ MonthlyCharges + tenure, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1941.8  -468.6   -96.3   493.9  3474.0 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -2152.3306    22.0491  -97.62   <2e-16 ***
## MonthlyCharges    35.8332     0.3016  118.80   <2e-16 ***
## tenure            65.2792     0.3696  176.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 737.9 on 7040 degrees of freedom
## Multiple R-squared:  0.894,  Adjusted R-squared:  0.894 
## F-statistic: 2.968e+04 on 2 and 7040 DF,  p-value: < 2.2e-16
  • ANOVA table:
## Analysis of Variance Table
## 
## Response: TotalCharges
##                  Df     Sum Sq    Mean Sq F value    Pr(>F)    
## MonthlyCharges    1 1.5334e+10 1.5334e+10   28164 < 2.2e-16 ***
## tenure            1 1.6988e+10 1.6988e+10   31201 < 2.2e-16 ***
## Residuals      7040 3.8331e+09 5.4447e+05                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA table indicates that both the variables “MonthlyCharges” and “tenure” have a significant influence on the variable “TotalCharges.”

The significance is determined by the associated p-values, which are described as “extremely low” in this statement.

A low p-value typically suggests strong evidence against the null hypothesis,

indicating that the observed relationship between the predictor variables and the outcome variable is unlikely to be due to chance alone.

2.Encoding Categorical Variables:

  • Label Encoding: Converting categorical data to numbers where the order matters.

  • One-Hot Encoding: Converting categorical data to a binary (0 or 1) format.

3.Scaling and Normalization:

  • Min-Max Scaling: Scaling data to a fixed range 0 to 1.

5.2 Comparison of Cleaned Data and Feature-engineered Data

  • Cleaned Data
## 'data.frame':    7043 obs. of  21 variables:
##  $ customerID      : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender          : chr  "Female" "Male" "Male" "Male" ...
##  $ SeniorCitizen   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner         : chr  "Yes" "No" "No" "No" ...
##  $ Dependents      : chr  "No" "No" "No" "No" ...
##  $ tenure          : int  1 34 2 45 2 8 22 10 28 62 ...
##  $ PhoneService    : chr  "No" "Yes" "Yes" "No" ...
##  $ MultipleLines   : chr  "No" "No" "No" "No" ...
##  $ InternetService : chr  "DSL" "DSL" "DSL" "DSL" ...
##  $ OnlineSecurity  : chr  "No" "Yes" "Yes" "Yes" ...
##  $ OnlineBackup    : chr  "Yes" "No" "Yes" "No" ...
##  $ DeviceProtection: chr  "No" "Yes" "No" "Yes" ...
##  $ TechSupport     : chr  "No" "No" "No" "Yes" ...
##  $ StreamingTV     : chr  "No" "No" "No" "No" ...
##  $ StreamingMovies : chr  "No" "No" "No" "No" ...
##  $ Contract        : chr  "Month-to-month" "One year" "Month-to-month" "One year" ...
##  $ PaperlessBilling: chr  "Yes" "No" "Yes" "No" ...
##  $ PaymentMethod   : chr  "Electronic check" "Mailed check" "Mailed check" "Bank transfer (automatic)" ...
##  $ MonthlyCharges  : num  29.9 57 53.9 42.3 70.7 ...
##  $ TotalCharges    : num  29.9 1889.5 108.2 1840.8 151.7 ...
##  $ Churn           : chr  "No" "No" "Yes" "No" ...
  • Feature-engineered Data
## 'data.frame':    7043 obs. of  25 variables:
##  $ customerID                            : chr  "7590-VHVEG" "5575-GNVDE" "3668-QPYBK" "7795-CFOCW" ...
##  $ gender                                : int  0 1 1 1 0 0 1 0 0 1 ...
##  $ SeniorCitizen                         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Partner                               : int  1 0 0 0 0 0 0 0 1 0 ...
##  $ Dependents                            : int  0 0 0 0 0 0 1 0 0 1 ...
##  $ tenure                                : num  -1.2774 0.0663 -1.2366 0.5142 -1.2366 ...
##  $ PhoneService                          : int  0 1 1 0 1 1 1 0 1 1 ...
##  $ MultipleLines                         : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ InternetService                       : int  1 1 1 1 2 2 2 1 2 1 ...
##  $ OnlineSecurity                        : int  0 1 1 1 0 0 0 1 0 1 ...
##  $ OnlineBackup                          : int  1 0 1 0 0 0 1 0 0 1 ...
##  $ DeviceProtection                      : int  0 1 0 1 0 1 0 0 1 0 ...
##  $ TechSupport                           : int  0 0 0 1 0 0 0 0 1 0 ...
##  $ StreamingTV                           : int  0 0 0 0 0 1 1 0 1 0 ...
##  $ StreamingMovies                       : int  0 0 0 0 0 1 0 0 1 0 ...
##  $ ContractMonth.to.month                : int  1 0 1 0 1 1 1 1 1 0 ...
##  $ ContractOne.year                      : int  0 1 0 1 0 0 0 0 0 1 ...
##  $ ContractTwo.year                      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PaperlessBilling                      : int  1 0 1 0 1 1 1 0 1 0 ...
##  $ PaymentMethodBank.transfer..automatic.: int  0 0 0 1 0 0 0 0 0 1 ...
##  $ PaymentMethodCredit.card..automatic.  : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ PaymentMethodElectronic.check         : int  1 0 0 0 1 1 0 0 1 0 ...
##  $ PaymentMethodMailed.check             : int  0 1 1 0 0 0 0 1 0 0 ...
##  $ MonthlyCharges                        : num  -1.16 -0.26 -0.363 -0.746 0.197 ...
##  $ Churn                                 : int  0 0 1 0 1 1 0 0 1 0 ...

6 Model Training

library(readr)
library(ggplot2)
library(lattice)
library(caret)
library(xgboost)
library(pROC)
library(dplyr)
library(png)
data <- read_csv("../data/feature/feature_set.csv")

6.1 Separate the Data

Separate the data into training and testing set

n <- nrow(data)
data <- data[sample(n),]  #將資料進行隨機排列
index <- createDataPartition(data$Churn, p = 0.8, list = FALSE)
train_data <- data[index, ] 
test_data <- data[-index, ]

6.2 Select Model

Using training set to train the xgboost model.

  • 為何選擇 XGBoost Model ?

    XGBoost (eXtreme Gradient Boosting) 是一種梯度提升樹模型,它是通過迭代地訓練一系列決策樹,

    每一棵樹都是針對前一棵樹的殘差進行訓練,然後將這些決策樹組合以得出最終預測結果。

  • XGBoost 具有下列優點:

    • 高準確性:XGBoost 擅長處理大量數據和複雜模式,因此通常具有較高的預測準確性。

    • 強大的正則化:XGBoost 支持 L1 和 L2 正規化,這有助於防止over fitting,提高模型的泛化能力(Generation Ability)。

    • 快速執行速度:相對於其他機器學習算法,XGBoost 通常執行速度較快,尤其是在大型數據集上。

train_matrix <- xgb.DMatrix(data = as.matrix(train_data[-c(1, which(names(train_data) == "Churn"))]), label = train_data$Churn)
test_matrix <- xgb.DMatrix(data = as.matrix(test_data[-c(1, which(names(test_data) == "Churn"))]), label = test_data$Churn)

6.3 K-Fold

k-fold cross validation

  • objective:模型的目標函數,採用 binary:logistic 執行二元分類任務,使用 logistic 回歸作為目標函數

  • eval_metric:評估指標的設置,採用 auc 評估模型對二元分類問題的預測能力

  • eta:learning rate,eta = 0.1

  • max_depth:樹的最大深度限制,此參數控制每棵決策樹的深度,避免over fittingmax_depth = 6

  • nrounds:訓練迭代的次數,nrounds = 100

  • nfold:cross validation fold數,nfold = 5

  • early_stopping_rounds:提前停止的輪數,當模型在連續幾輪迭代中沒有顯著改善時提前停止訓練避免過度擬合early_stopping_rounds = 10

  • verbose:顯示訓練過程的詳細程度,verbose = 1 表示在訓練過程中顯示詳細訊息,包含每輪訓練的效果

params <- list(
  objective = "binary:logistic",
  eval_metric = "auc",
  eta = 0.1,
  max_depth = 6
)

nrounds <- 100
nfold <- 5
early_stopping_rounds <- 10
verbose <- 1

# using xgb.cv to do k-fold cross validation 
cv_result <- xgb.cv(
  params = params,
  data = train_matrix,
  nrounds = nrounds,
  nfold = nfold,
  early_stopping_rounds = early_stopping_rounds,
  verbose = verbose
)
## [1]  train-auc:0.861850+0.002622 test-auc:0.829396+0.008566 
## Multiple eval metrics are present. Will use test_auc for early stopping.
## Will train until test_auc hasn't improved in 10 rounds.
## 
## [2]  train-auc:0.866396+0.002240 test-auc:0.832197+0.008757 
## [3]  train-auc:0.870848+0.001493 test-auc:0.835939+0.007664 
## [4]  train-auc:0.873138+0.001255 test-auc:0.836643+0.007631 
## [5]  train-auc:0.875058+0.001747 test-auc:0.837335+0.008254 
## [6]  train-auc:0.877000+0.001650 test-auc:0.838641+0.008024 
## [7]  train-auc:0.878480+0.001298 test-auc:0.838886+0.008890 
## [8]  train-auc:0.880055+0.001322 test-auc:0.839104+0.008558 
## [9]  train-auc:0.882262+0.000998 test-auc:0.839055+0.009062 
## [10] train-auc:0.883946+0.000913 test-auc:0.839692+0.008913 
## [11] train-auc:0.885643+0.000876 test-auc:0.839822+0.009267 
## [12] train-auc:0.886983+0.001333 test-auc:0.839862+0.009231 
## [13] train-auc:0.888795+0.001047 test-auc:0.840534+0.009096 
## [14] train-auc:0.890040+0.000879 test-auc:0.840337+0.008964 
## [15] train-auc:0.891226+0.000934 test-auc:0.840541+0.008719 
## [16] train-auc:0.892547+0.001136 test-auc:0.840477+0.008969 
## [17] train-auc:0.893718+0.001242 test-auc:0.840612+0.009098 
## [18] train-auc:0.894905+0.001320 test-auc:0.840500+0.009055 
## [19] train-auc:0.896085+0.001342 test-auc:0.840624+0.008619 
## [20] train-auc:0.897161+0.001630 test-auc:0.840992+0.008518 
## [21] train-auc:0.898272+0.001697 test-auc:0.841022+0.008607 
## [22] train-auc:0.899461+0.001918 test-auc:0.841445+0.008243 
## [23] train-auc:0.900596+0.002054 test-auc:0.842186+0.008258 
## [24] train-auc:0.901434+0.001984 test-auc:0.842522+0.008213 
## [25] train-auc:0.902257+0.001977 test-auc:0.842682+0.008186 
## [26] train-auc:0.903131+0.001960 test-auc:0.842870+0.008094 
## [27] train-auc:0.903941+0.002026 test-auc:0.843201+0.007914 
## [28] train-auc:0.904706+0.002114 test-auc:0.843209+0.007670 
## [29] train-auc:0.905626+0.002138 test-auc:0.843460+0.007424 
## [30] train-auc:0.906504+0.002144 test-auc:0.843445+0.007235 
## [31] train-auc:0.907173+0.002194 test-auc:0.843835+0.007361 
## [32] train-auc:0.907886+0.002331 test-auc:0.843703+0.007475 
## [33] train-auc:0.908681+0.002412 test-auc:0.843989+0.007411 
## [34] train-auc:0.909276+0.002385 test-auc:0.844337+0.007236 
## [35] train-auc:0.909841+0.002356 test-auc:0.844412+0.007078 
## [36] train-auc:0.910339+0.002190 test-auc:0.844325+0.007001 
## [37] train-auc:0.910912+0.002235 test-auc:0.844114+0.006830 
## [38] train-auc:0.911259+0.002362 test-auc:0.844336+0.006830 
## [39] train-auc:0.911818+0.002438 test-auc:0.844126+0.006738 
## [40] train-auc:0.912281+0.002459 test-auc:0.844266+0.006671 
## [41] train-auc:0.912613+0.002451 test-auc:0.844309+0.006794 
## [42] train-auc:0.913030+0.002411 test-auc:0.844194+0.006784 
## [43] train-auc:0.913563+0.002412 test-auc:0.844052+0.006794 
## [44] train-auc:0.914048+0.002631 test-auc:0.843940+0.006874 
## [45] train-auc:0.914568+0.002769 test-auc:0.843938+0.006803 
## Stopping. Best iteration:
## [35] train-auc:0.909841+0.002356 test-auc:0.844412+0.007078
print(cv_result)
## ##### xgb.cv 5-folds
##   iter train_auc_mean train_auc_std test_auc_mean test_auc_std
##  <num>          <num>         <num>         <num>        <num>
##      1      0.8618499  0.0026222501     0.8293963  0.008565975
##      2      0.8663957  0.0022402966     0.8321966  0.008756756
##      3      0.8708477  0.0014929684     0.8359394  0.007664069
##      4      0.8731384  0.0012546437     0.8366427  0.007631390
##      5      0.8750579  0.0017467107     0.8373349  0.008253984
##      6      0.8770001  0.0016499774     0.8386414  0.008024405
##      7      0.8784802  0.0012981178     0.8388861  0.008890248
##      8      0.8800547  0.0013218481     0.8391040  0.008558222
##      9      0.8822619  0.0009983678     0.8390551  0.009062321
##     10      0.8839459  0.0009134497     0.8396924  0.008913261
##     11      0.8856426  0.0008761277     0.8398215  0.009266839
##     12      0.8869827  0.0013332246     0.8398617  0.009231400
##     13      0.8887952  0.0010472191     0.8405342  0.009096263
##     14      0.8900403  0.0008791414     0.8403367  0.008963948
##     15      0.8912256  0.0009344499     0.8405413  0.008719098
##     16      0.8925473  0.0011364011     0.8404771  0.008969380
##     17      0.8937185  0.0012417045     0.8406116  0.009097518
##     18      0.8949047  0.0013195998     0.8404995  0.009055130
##     19      0.8960849  0.0013415834     0.8406240  0.008618556
##     20      0.8971608  0.0016302269     0.8409917  0.008517739
##     21      0.8982719  0.0016971787     0.8410223  0.008606604
##     22      0.8994605  0.0019179665     0.8414446  0.008243496
##     23      0.9005957  0.0020543188     0.8421858  0.008258360
##     24      0.9014341  0.0019840219     0.8425218  0.008213243
##     25      0.9022565  0.0019768277     0.8426823  0.008185547
##     26      0.9031308  0.0019599501     0.8428702  0.008094295
##     27      0.9039412  0.0020257121     0.8432008  0.007913971
##     28      0.9047063  0.0021140172     0.8432089  0.007670260
##     29      0.9056262  0.0021378421     0.8434603  0.007424200
##     30      0.9065042  0.0021437825     0.8434452  0.007234667
##     31      0.9071728  0.0021935108     0.8438354  0.007360786
##     32      0.9078858  0.0023305020     0.8437029  0.007474565
##     33      0.9086805  0.0024116239     0.8439887  0.007410970
##     34      0.9092762  0.0023850154     0.8443374  0.007236349
##     35      0.9098405  0.0023557097     0.8444121  0.007078215
##     36      0.9103394  0.0021895584     0.8443254  0.007001304
##     37      0.9109120  0.0022348397     0.8441142  0.006829744
##     38      0.9112591  0.0023615401     0.8443361  0.006830473
##     39      0.9118181  0.0024376915     0.8441261  0.006738451
##     40      0.9122808  0.0024588206     0.8442657  0.006671313
##     41      0.9126133  0.0024514499     0.8443092  0.006793853
##     42      0.9130296  0.0024110171     0.8441940  0.006783673
##     43      0.9135628  0.0024117891     0.8440515  0.006793743
##     44      0.9140484  0.0026305584     0.8439402  0.006873604
##     45      0.9145685  0.0027689516     0.8439385  0.006803341
##   iter train_auc_mean train_auc_std test_auc_mean test_auc_std
## Best iteration:
##   iter train_auc_mean train_auc_std test_auc_mean test_auc_std
##  <num>          <num>         <num>         <num>        <num>
##     35      0.9098405    0.00235571     0.8444121  0.007078215
  1. 高AUC評分:反覆運算的 AUC 評分達到整個反覆運算過程中最高

  2. 合理的標準差:標準差相對較小,表明模型的性能在不同的交叉驗證折次中較為穩定

相比之下,後續的反覆運算雖然也達到了類似的 AUC 評分,但並沒有顯著超過最佳迭代次數的 AUC 評分,
而且有些反覆運算的 AUC 評分反而有所下降。

例如:
超過最佳迭代次數後反覆運算的 test-auc 降低。
此外,在最佳迭代次數後,儘管訓練集上的 AUC 評分(train-auc)不斷上升,
但測試集上的 AUC 評分並沒有明顯改善,甚至有所波動,這可能表明模型開始 over fitting 訓練數據。
# Create DMatrix object
# data_matrix <- xgb.DMatrix(data = as.matrix(data[-c(1, which(names(data) == "Churn"))]), label = data$Churn)

# get the best iteration
best_nrounds <- cv_result$best_iteration

model <- xgb.train(
  params = params,
  data = train_matrix,
  nrounds = best_nrounds,
  verbose = TRUE
)

6.4 Predict Label and Probability

train_pred <- predict(model, train_matrix)
train_predicted_label <- ifelse(train_pred > 0.5, 1, 0)

test_pred <- predict(model, test_matrix)
test_predicted_label <- ifelse(test_pred > 0.5, 1, 0)

*使用模型對訓練集和測試集分別進行預測,然後將預測的機率值轉換成二分類標籤。

6.5 Compute the ROC

Compute the ROC of training and testing data

roc_train <- roc(train_data$Churn, train_pred)
roc_test <- roc(test_data$Churn, test_pred)

auc_train <- auc(roc_train)
auc_test <- auc(roc_test)

# Assuming roc_train_plot and roc_test_plot are ggplot objects for ROC curves
roc_train_plot <- ggroc(roc_train) + ggtitle("ROC Curve - Training Data")
roc_test_plot <- ggroc(roc_test) + ggtitle("ROC Curve - Testing Data")

# Save ROC curves
ggsave("roc_train.png", plot = roc_train_plot)
ggsave("roc_test.png", plot = roc_test_plot)

print(paste("AUC of training data: ", auc_train))
## [1] "AUC of training data:  0.902676364275942"
print(roc_train_plot)

print(paste("AUC of testing data: ", auc_test))
## [1] "AUC of testing data:  0.832333067305136"
print(roc_test_plot)

6.6 Null Model Comparison

  • Null model prediction

    根據 Chrun 的平均值來進行預測

# Null model prediction: using the mean of Churn in the training set as the probability
mean_churn <- mean(train_data$Churn)
null_train_pred <- rep(mean_churn, nrow(train_data))
null_test_pred <- rep(mean_churn, nrow(test_data))
  • Compute the ROC of null model
# Compute the ROC of null model
roc_null_train <- roc(train_data$Churn, null_train_pred)
roc_null_test <- roc(test_data$Churn, null_test_pred)
auc_null_train <- auc(roc_null_train)
auc_null_test <- auc(roc_null_test)
  • Print AUC values for the null model
# Print AUC values for the null model
print(paste("AUC for null model on training data: ", auc_null_train))
## [1] "AUC for null model on training data:  0.5"
print(paste("AUC for null model on testing data: ", auc_null_test))
## [1] "AUC for null model on testing data:  0.5"
# 使用 DeLong 檢驗比較兩個 AUC
roc_test_xgb <- roc(test_data$Churn, test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc_test_null <- roc(test_data$Churn, null_test_pred)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
# DeLong 檢驗
delong_test <- roc.test(roc_test_xgb, roc_test_null, method="delong")

# 輸出檢驗結果
print(delong_test)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  roc_test_xgb and roc_test_null
## Z = 28.516, p-value < 2.2e-16
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  0.3094911 0.3551751
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.8323331   0.5000000
  • DeLong 檢定是一種用於比較兩條相關的受試者操作特徵(ROC)曲線下面積(AUC)的統計方法

    • 檢定統計量 Z:兩個模型的 AUC 差異較大

    • p-value:表明 AUC 之間的差異具有統計學意義。(通常 p < 0.05 被認為是顯著的)

    • AUC 差異的 95% 信賴區間:意味著我們有 95% 的信心認為真實的 AUC 差異在這個區間內。由於該區間不包含 0,這進一步支持 AUC 之間存在顯著差異的結論。

    • DeLong 檢定表明,兩條ROC曲線的 AUC 之間存在統計學上顯著的差異

  • 高Z值極低的 p-value 以及 AUC 差異的信賴區間可以看出 XGBoost 顯著優於 null model

6.7 Feature Importance and Ranking

XGBoost 基於每個特徵在所有樹中減少損失的貢獻來計算這些分數

有助於了解哪些特徵對模型的預測能力貢獻最大

importance_matrix <- xgb.importance(model = model, feature_names = colnames(train_data[-c(1, which(names(train_data) == "Churn"))]))
xgb.plot.importance(importance_matrix)

***可以看出 Contract feature裡 Month.to.month:這個特徵具有最高的重要性分數,表示它是最重要的預測變數。

6.8 Save the Results

1. Save the prediction result

train_results <- data.frame(customerID = train_data$customerID, label = train_predicted_label, probability = train_pred, groundtruth = train_data$Churn)

test_results <- data.frame(customerID = test_data$customerID, label = test_predicted_label, probability = test_pred, groundtruth = test_data$Churn)

write_csv(train_results, "train_predictions.csv")
write_csv(test_results, "test_predictions.csv")

head(test_results, 5)
##   customerID label probability groundtruth
## 1 4007-NHVHI     0  0.38441196           0
## 2 9909-IDLEK     0  0.30783439           0
## 3 9039-ZVJDC     0  0.22101326           0
## 4 6734-JDTTV     0  0.02281840           0
## 5 3507-GASNP     0  0.02314463           0

2. Save the feature importance result

write_csv(importance_matrix, "feature_importance.csv")
head(importance_matrix, 5)
##                          Feature       Gain      Cover  Frequency Importance
##                           <char>      <num>      <num>      <num>      <num>
## 1:        ContractMonth.to.month 0.39571682 0.13554263 0.01602959 0.39571682
## 2:                        tenure 0.18780881 0.20526466 0.25215783 0.18780881
## 3:                MonthlyCharges 0.12510372 0.19727723 0.31134402 0.12510372
## 4:               InternetService 0.11843122 0.12078279 0.03082614 0.11843122
## 5: PaymentMethodElectronic.check 0.02648883 0.05423052 0.05055487 0.02648883

3. Save the XGBoost model

xgb.save(model, "../model/churn_prediction_model.xgb")
## [1] TRUE

4. Save the ROC curve

# Assuming roc_train_plot and roc_test_plot are ggplot objects for ROC curves
roc_train_plot <- ggroc(roc_train) + ggtitle("ROC Curve - Training Data")
roc_test_plot <- ggroc(roc_test) + ggtitle("ROC Curve - Testing Data")

# Save ROC curves
ggsave("roc_train.png", plot = roc_train_plot)
## Saving 7 x 5 in image
ggsave("roc_test.png", plot = roc_test_plot)
## Saving 7 x 5 in image
print(roc_train_plot)

print(roc_test_plot)


7 Lift Analysis Introduction

Lift analysis 是一種在資料科學和機器學習中常用的評估技術,尤其在行銷和推薦系統中十分常見,

主要目的是衡量一個策略、活動或模型相對於隨機選擇的效果提升。

以下是 Lift analysis 的一些關鍵點:

7.1 Definition

\(Lift\) 是一個比率,表示目標行為在有策略介入時的發生率,與無策略介入時的發生率之比。

公式為 \[ Lift = \frac {P(B|A)}{P(B)}\]

其中 \(P(B∣A)\) 是在條件 \(A\) 下發生 \(B\) 的概率,而 \(P(B)\) 是無條件下發生 \(B\) 的概率。

7.2 Goal

\(Lift\) 分析幫助確定某個特定行動或模型是否對結果有正向影響,以及這個影響是否顯著超過隨機事件。

7.3 Application

  • 行銷活動:分析特定行銷活動對購買行為的影響。
  • 推薦系統:評估推薦算法是否有效提高用戶的點擊率或購買率。
  • 風險評估:在金融業中,用於評估某策略對減少欺詐行為的有效性。

7.4 Pros and Cons

  • 優點:直觀,容易理解和溝通;有助於快速識別最有效的策略或客戶群體。
  • 限制:不考慮潛在的偏誤或外部影響因素;高 \(Lift\) 值不一定代表高絕對效益,特別是基礎概率 \(P(B)\) 很低時。

(Credit by ChatGPT)

# sorting by probability
test_results <- test_results[order(-test_results$probability),]

head(test_results, 10)
##      customerID label probability groundtruth
## 466  3296-SILRA     1   0.8968905           1
## 972  6356-ELRKD     1   0.8941033           1
## 1366 8375-DKEBR     1   0.8929382           1
## 569  8062-YBDOE     1   0.8927506           1
## 230  0495-RVCBF     1   0.8844432           1
## 339  6416-TVAIH     1   0.8828868           1
## 359  5192-EBGOV     1   0.8817466           1
## 800  8740-CRYFY     1   0.8810892           1
## 802  1761-AEZZR     1   0.8810892           1
## 1045 5196-SGOAK     1   0.8788865           1
# Segmenat the customer
test_results$decile <- cut(test_results$probability, breaks=quantile(test_results$probability, probs=seq(0, 1, by = 0.1)), include.lowest=TRUE, labels=FALSE)

# Reverse the decilne numbering
test_results$decile <- 11 - test_results$decile

head(test_results, 5)
##      customerID label probability groundtruth decile
## 466  3296-SILRA     1   0.8968905           1      1
## 972  6356-ELRKD     1   0.8941033           1      1
## 1366 8375-DKEBR     1   0.8929382           1      1
## 569  8062-YBDOE     1   0.8927506           1      1
## 230  0495-RVCBF     1   0.8844432           1      1
# 計算每個分組的實際響應率和 Lift
test_lift_df <- test_results %>%
  group_by(decile) %>%
  summarise(
    count = n(),
    num_responses = sum(label),
    response_rate = mean(label),
    lift = response_rate / mean(data$Churn)
  )

plot <- ggplot(test_lift_df, aes(x = as.factor(decile), y = lift)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Lift Chart", x = "Decile", y = "Lift") +
  theme_minimal()

# 使用 ggsave 保存圖形
ggsave("lift_chart.png", plot, width = 10, height = 6, dpi = 300)
img <- readPNG("lift_chart.png")
print(plot)

# Save the lift data to CSV
write.csv(test_lift_df, "lift_data.csv", row.names = FALSE)
head(test_lift_df, 10)
## # A tibble: 10 × 5
##    decile count num_responses response_rate  lift
##     <dbl> <int>         <dbl>         <dbl> <dbl>
##  1      1   141           141        1      3.77 
##  2      2   141           141        1      3.77 
##  3      3   141             4        0.0284 0.107
##  4      4   140             0        0      0    
##  5      5   141             0        0      0    
##  6      6   141             0        0      0    
##  7      7   140             0        0      0    
##  8      8   141             0        0      0    
##  9      9   137             0        0      0    
## 10     10   145             0        0      0